Skip to content

Non-emoji numerals are detected as emoji#3

Open
kainosnoema wants to merge 1 commit intotoddkramer:masterfrom
cotap:handle-composed-sequences
Open

Non-emoji numerals are detected as emoji#3
kainosnoema wants to merge 1 commit intotoddkramer:masterfrom
cotap:handle-composed-sequences

Conversation

@kainosnoema
Copy link
Copy Markdown
Contributor

@kainosnoema kainosnoema commented Jul 22, 2016

Non-emoji numerals are treated as emoji. e.g. this fails:

XCTAssertFalse("1234567890".containsEmoji())

This is because String.unicodeScalers() splits emojis into their codepoints, which for some characters yields standard ASCII. As an example, here are the codepoints for the "0 in a box" emoji:

- [1065] : "0"
- [1066] : "\u{FE0F}"
- [1067] : "\u{20E3}"

It's not as easy as removing ASCII characters from the list of unicode scalars, since that would break the implementation of containsEmojiOnly(). One solution would be to find a way to split strings into their composed character sequences, but then you'd have to also combine all possible modifier permutations. Still thinking of the proper way to solve this.

@kainosnoema
Copy link
Copy Markdown
Contributor Author

Progress: it seems that the only way to properly detect a sequence of codepoints is using enumerateSubstringsInRange(startIndex..<endIndex, options: .ByComposedCharacterSequences). Using this method, we can break down both the emojis and the input string into character sequences, then comparison can be done properly.

The one hitch to this solution is the one I mentioned about modifiers, but that can be handled by checking if the sequence is made up of two codepoints, the first one being an emoji and the second one being a modifier.

I'm working on a pull-request now with this approach, adding tests as I go.

Due to the use of `unicodeScalars` previously, some ASCII characters
were being identified as emoji. In particular, the "Keycap Digit N"
characters are composed of the ASCII character followed by two other
codepoints. Keycap Digit Zero, for example, contains these scalars:

```
- [1065] : "0"
- [1066] : "\u{FE0F}"
- [1067] : "\u{20E3}"
```

In order to properly handle these sequences without false positives, we
have to split emojis into their composed character sequences and store
those as a set instead. The one complication here is that there are many
permutations of emoji with the skin tone modifiers. Instead of storing
each of these, we simply check if a character sequence has two
codepoints, and if so, that the first character is an emoji and the
second is a skin tone modifier. This is a fairly simple and efficient
way to accurately identify the presence of valid emoji.

Signed-off-by: Evan Owen <kainosnoema@gmail.com>
@kainosnoema
Copy link
Copy Markdown
Contributor Author

kainosnoema commented Jul 22, 2016

Alright, here's a stab at fixing things. It requires a dramatically different approach to emoji detection, but it seems to be the most straightforward way to accurately detect emoji without false-positives on ASCII digits. Performance is good too after the first enumeration of all emoji sequences.

Because it's so different though, you may have some suggestions on how to improve.

Edit: The other major change here is that I've removed UnicodeScalar.isEmoji(), since that doesn't really make any sense—many emoji are made up of multiple UnicodeScalar. It's a breaking API change, but maybe one that wasn't intended to be used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant